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Abstract 

We propose an efficient way to sample from a class of structured multivariate Gaussian 
distributions which routinely arise as conditional posteriors of model parameters that are 
assigned a conditionally Gaussian prior. The proposed algorithm only requires matrix 
operations in the form of matrix multiplications and linear system solutions. We exhibit 
that the computational complexity of the proposed algorithm grows linearly with the 
dimension unlike existing algorithms relying on Cholesky factorizations with cubic orders 
of complexity. The algorithm should be broadly applicable in settings where Gaussian 
scale mixture priors are used on high dimensional model parameters. We provide an 
illustration through posterior sampling in a high dimensional regression setting with a 
horseshoe prior (Garvalho et ah, 2010) on the vector of regression coefficients. 
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1 Introduction 


Continuous shrinkage priors have recently received significant attention as a mechanism to 
induce approximate sparsity in high-dimensional parameters. Such prior distributions can 
mostly be expressed as global-local scale mixtures of Gaussians (Bhattacharya et ah, 2015; 
Poison & Scott, 2010). These global-local priors (Poison & Scott, 2010) aim to shrink noise 
coefficients while retaining any signal, thereby providing an approximation to the operating 
characteristics of discrete mixture priors (George &: McGulloch, 1997; Scott & Berger, 2010), 
which allow a subset of the parameters to be exactly zero. 

A major attraction of global-local priors has been computational efficiency and simplic¬ 
ity. Posterior inference poses a stiff challenge for discrete mixture priors in moderate to 
high-dimensional settings, but the scale-mixture representation of global-local priors allows 
parameters to be updated in blocks via a fairly automatic Gibbs sampler in a wide variety of 
problems. These include regression (Armagan et ah, 2013; Caron &: Doucet, 2008), variable 
selection (Hahn &: Carvalho, 2015), wavelet denoising (Poison &: Scott, 2010), factor models 
and covariance estimation (Bhattacharya &: Dunson, 2011; Pati et ah, 2014), and time series 
(Durante et ah, 2014). Rapid mixing and convergence of the resulting Gibbs sampler for spe¬ 
cific classes of priors has been recently established in the high-dimensional regression context 
by Khare & Robert (2013) and Pal & Khare (2014). Moreover, recent results suggest that a 
subclass of global-local priors can achieve the same minimax rates of posterior concentration 
as the discrete mixture priors in high-dimensional estimation problems (Bhattacharya et ah, 
2015; Pati et ah, 2014; van der Pas et ah, 2014). 

In this article, we focus on computational aspects of global-local priors in the high¬ 
dimensional linear regression setting 

y = X/3 + e, e~N(0,a2R), (1) 

where X G is a n x p matrix of covariates with the number of variables p potentially 

much larger than the sample size n. A global-local prior on /3 assumes that 

/3j I Aj,r,cj ~ N(0,A^rV^), (j = l,...,p), (2) 

Ai~/, (j = l,...,p) (3) 

r ~ gr, cr ~ h, (4) 

where f,g and h are densities supported on (0,oo). The XjS are usually referred to as local 
scale parameters while r is a global scale parameter. Different choices of / and g lead to 
different classes of priors. For instance, a half-Gauchy distribution for / and g leads to the 
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horseshoe prior of Carvalho et al. (2010). In the p ^ n setting where most entries of 13 are 
assumed to be zero or close to zero, the choices of / and g play a key role in controlling the 
effective sparsity and concentration of the prior and posterior (Pati et ah, 2014; Poison & 
Scott, 2010). 

Exploiting the scale-mixture representation 2, it is straightforward in principle to formu¬ 
late a Gibbs sampler. The conditional posterior of (3 given A = (Ai, ..., XpY ,t and a is given 
by 

(3\y,\T,a^n{A-^X'^y,a^A-^), A = {X^X + K:^), A, = r2diag(A?,..., A^). (5) 

Further, the p local scale parameters Xj have conditionally independent posteriors and hence 
A = (Al,.. ., XpY can be updated in a block by slice sampling (Poison et ah, 2014) if condi¬ 
tional posteriors are unavailable in closed form. However, unless care is exercised, sampling 
from (5) can be expensive for large values of p. Existing algorithms (Rue, 2001) to sample 
from (5) face a bottleneck for large p to perform a Cholesky decomposition of A at each 
iteration. One cannot resort to precomputing the Cholesky factors since the matrix A* in (5) 
changes from at each iteration. In this article, we present an exact sampling algorithm for 
Gaussian distributions (5) which relies on data augmentation. We show that the computa¬ 
tional complexity of the algorithm scales linearly in p. 

2 The algorithm 

Suppose we aim to sample from Np(^, S), with 

S = (4>T$ + Zl-i)-i, = (6) 

where D G is symmetric positive definite, <I> G and a G (5) is a special 

case of (6) with <I> = X/a, D = ct^A* and a = yja. A similar sampling problem arises in all 
the applications mentioned in §1, and the proposed approach can be used in such settings. 
In the sequel, we do not require D to be diagonal, however we assume that D~^ is easy to 
calculate and it is straightforward to sample from N(0,iA). This is the case, for example, if 
D corresponds to the covariance matrix of an AR(g) process or a Gaussian Markov random 
field. 

Letting Q = = (4>'^<h -|- D~^) denote the precision, inverse covariance, matrix and 

b = <I>'^a, we can write p, = Q~^b. Rue (2001) proposed an efficient algorithm to sample 
from a X{Q~^b,Q~^) distribution that avoids explicitly calculating the inverse of Q, which 
is computationally expensive and numerically unstable. Instead, the algorithm in §3.1.2. of 
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Rue (2001) performs a Cholesky decomposition of Q and uses the Cholesky factor to solve 
a series of linear systems to arrive at a sample from the desired Gaussian distribution. The 
original motivation was to efficiently sample from Gaussian Markov random fields where Q 
has a banded structure so that the Cholesky factor and the subsequent linear system solvers 
can be computed efficiently. Since Q = + D~^) does not have any special structure 

in the present setting, the Cholesky factorization has complexity 0{p^); see §4.2.3 of Golub 
& Van Loan (1996), which increasingly becomes prohibitive for large p. We present an 
alternative exact mechanism to sample from a Gaussian distribution with parameters as in 
(6) below: 


Algorithm 1 Proposed algorithm 

(i) Sample u ~ N(0,Z1) and 5 ~ N(0,I„) independently. 

(ii) Set V = ^u + 5. 

(hi) Solve (<kL>4>'^ + \n)w = {a — v). 

(iv) Set 9 = u + D^'^w. 


Proposition 2.1. Suppose 6 is obtained by following Algorithm 1. Then, 6 ~ S), where 

p and T, are as in (6). 

Proof. By the Sherman-Morrison-Woodbury identity (Hager, 1989) and some algebra, p = 
+ I„)“^a. By construction, v ~ N(0, + In)- Combining steps (hi) and 

(iv) of Algorithm 1, we have 6 = u + + I„)“^(q; — v). Hence 9 has a normal 

distribution with mean +ln)~^a = p. Since cov(u, v) = we obtain cov(9) = 

D — = S, again by the Sherman-Morrison-Woodbury identity. This 

completes the proof; a constructive proof is given in the Appendix. □ 

While Algorithm 1 is valid for all n and p, the computational gains are biggest when pi§> n 
and N(0,D) is easily sampled. Indeed, the primary motivation is to use data augmentation 
to cheaply sample f = (u^,ii^)^ G and obtain a desired sample from (6) via linear 

transformations and marginalization. When D is diagonal, as in the case of global-local 
priors (2), the complexity of the proposed algorithm is O(n^p); the proof uses standard 
results about complexity of matrix multiplications and linear system solutions; see §1.3 & 3.2 
of Golub & Van Loan (1996). For non-sparse D, calculating has a worst-case complexity 
of 0{np^), which is the dominating term in the complexity calculations. In comparison 
to the 0{p^) complexity of the competing algorithm in Rue (2001), Algorithm 1 therefore 
offers huge gains when p n. For example, to run 6000 iterations of a Gibbs sampler 
for the horseshoe prior (Carvalho et ah, 2010) with sample size n = 100 in MATLAB on a 
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INTEL(E5-2690) 2.9 GHz machine with 64 GB DDRS memory, Algorithm 1 takes roughly 
the same time as Rue (2001) when p = 500 but offers a speed-up factor of over 250 when 
p = 5000. MATLAB code for the above comparison and subsequent simulations is available 
at https://github.com/antikO15/Fast-Sampling-of-Gaussian-Posteriors.git. 

The first line of the proof implies that Algorithm 1 outputs p if one sets n = 0,5 = 0 in 
step (i). The proof also indicates that the log-density of (6) can be efficiently calculated at any 
X G Indeed, since is readily available, x'^S~^x and are cheaply calculated 

and log|S“^| can be calculated in O(n^) steps using the identity |R -|- AB\ = IR -|- BA\ 
for A G W^^,B G Einally, from the proof, = a^^{D — 

which can be calculated in 

0{rA) operations. 

3 Frequentist operating characteristics in high dimensions 

The proposed algorithm provides an opportunity to compare the frequentist operating char¬ 
acteristics of shrinkage priors in high-dimensional regression problems. We compare various 
aspects of the horseshoe prior (Garvalho et al., 2010) to frequentist procedures and obtain 
highly promising results. We expect similar results for the Dirichlet-Laplace (Bhattacharya 
et ah, 2015), normal-gamma (Griffin & Brown, 2010) and generalized double-Pareto (Arma- 
gan et al., 2013) priors, which we hope to report elsewhere. 

We first report comparisons with smoothly clipped absolute deviation (Ean & Li, 2001) 
and minimax concave penalty (Zhang, 2010) methods. We considered model (1) with n = 200, 
p = 5000 and a = 1.5. Letting Xi denote the fth row of X, the XjS were independently 
generated from Np(0,S), with (i) S = R and (ii) = 1,'EjjT = 0.5, j / j'^ = l,...,p, 
compound symmetry. The true /3o had 5 non-zero entries in all cases, with the non-zero entries 
having magnitude (a) {1.5,1.75,2,2.25,2.5} and (b) {0.75,1,1.25,1.5,1.75}, multiplied by 
a random sign. Eor each case, we considered 100 simulation replicates. The frequentist 
penalization approaches were implemented using the R package ncvreg via 10-fold cross- 
validation. Eor the horseshoe prior, we considered the posterior mean and the point-wise 
posterior median as point estimates. Eigures 1 and 2 report boxplots for £i, \\j3 — /3o||i) 
R, 11/3 — /dolb, and prediction, \\Xj3 — X/Solb, errors across the 100 replicates for the two 
signal strengths. The horseshoe prior is highly competitive across all simulation settings, in 
particular when the signal strength is weaker. An interesting observation is the somewhat 
superior performance of the point-wise median even under an R loss; a similar fact has 
been observed about point mass mixture priors (Castillo & van der Vaart, 2012) in high 
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Figure 1: Boxplots of £i, £2 and prediction error across 100 simulation replicates. HSme and 
HSm respectively denote posterior point wise median and mean for the horeshoe prior. True 
/3o is 5-sparse with non-zero entries ±{1.5,1.75, 2, 2.25, 2.5}. Top row: S = Ip (independent). 
Bottom row: T,jj = = 0.5, j / (compound symmetry). 
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Figure 2: Same setting as in Fig 1. Trne /3o is 5-sparse with non-zero entries 
±{0.75,1,1.25,1.5,1.75}. 
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dimensions. We repeated the simulation with p = 2500 with similar conclusions. Overall, out 
of the 24 settings, the horseshoe prior had the best average performance over the simulation 
replicates in 22 cases. 

While there is now a huge literature on penalized point estimation, uncertainty charac¬ 
terization m p > n settings has received attention only recently (Javanmard & Montanari, 
2014; van de Geer et ah, 2014; Zhang & Zhang, 2014). Although Bayesian procedures provide 
an automatic characterization of uncertainty, the resulting credible intervals may not possess 
the correct frequentist coverage in nonparametric/high-dimensional problems (Szabo et ah, 
2015). This led us to investigate the frequentist coverage of shrinkage priors inp > n settings; 
it is trivial to obtain element-wise credible intervals for the /3jS from the posterior samples. 
We compared the horseshoe prior with van de Geer et al. (2014) and Javanmard & Montanari 
(2014), which can be used to obtain asymptotically optimal element wise confidence intervals 
for the 13jS. We considered a similar simulation scenario as before. We let p G {500,1000}, 
and considered a Toeplitz structure, for the covariate design (van de Geer 

et ah, 2014) in addition to the independent and compound symmetry cases stated already. 
The first two rows of Table 1 report the average coverage percentages and 100 x lengths of 
confidence intervals over 100 simulation replicates, averaged over the 5 signal variables. The 
last two rows report the same averaged over the {p — 5) noise variables. 

Table 1 shows that the horseshoe has a superior performance. An attractive adaptive 
property of shrinkage priors emerges, where the lengths of the intervals automatically adapt 
between the signal and noise variables, maintaining the nominal coverage. The frequentist 
procedures seem to yield approximately equal sized intervals for the signals and noise vari¬ 
ables. The default choice of the tuning parameter A x (logp/n)^/^ suggested in van de 
Geer et al. (2014) seemed to provide substantially poorer coverage for the signal variables 
at the cost of improved coverage for the noise, and substantial tuning was required to arrive 
at the coverage probabilities reported. The default approach of Javanmard & Montanari 
(2014) produced better coverages for the signals compared to van de Geer et al. (2014). 
The horseshoe and other shrinkage priors on the other hand are free of tuning parameters. 
The same procedure used for estimation automatically provides valid frequentist uncertainty 
characterization. 

4 Discussion 

Our numerical results warrant additional numerical and theoretical investigations into proper¬ 
ties of shrinkage priors in high dimensions. The proposed algorithm can be used for essentially 
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Table 1; Frequentist coverages (%) and lOOxlengths of point wise 95% intervals. Average 
coverages and lengths are reported after averaging across all signal variables (rows 1 and 2) 
and noise variables (rows 3 and 4). Subscripts denote 100 x standard errors for coverages. 
LASSO and SS respectively stand for the methods in van de Geer et al. (2014) and Javan- 
mard & Montanari (2014). The intervals for the horseshoe (HS) are the symmetric posterior 
credible intervals. 
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all the shrinkage priors in the literature and should prove useful in an exhaustive comparison 
of existing priors. Its scope extends well beyond linear regression. For example, extensions 
to logistic and probit regression are immediate using standard data augmentation tricks (Al¬ 
bert & Chib, 1993; Holmes & Held, 2006). Multivariate regression problems where one has a 
matrix of regression coefficients can be handled by block updating the vectorized coefficient 
matrix ; even if p < n, the number of regression coefficients may be large if the dimension 
of the response if moderate. Shrinkage priors have been used as a prior of factor loadings 
in Bhattacharya & Dunson (2011). While Bhattacharya &: Dunson (2011) update the p > n 
rows of the factor loadings independently, exploiting the assumption of independence in the 
idiyosyncratic components, their algorithm does not extend to approximate factor models, 
where the idiyosyncratic errors are dependent. The proposed algorithm can be adapted to 
such situations by block updating the vectorized loadings. Finally, we envision applications 
in high-dimensional additive models where each of a large number of functions is expanded 
in a basis, and the basis coefficients are updated in a block. 

Appendix 

Here we give a constructive argument which leads to the poroposed algorithm. By the 
Sherman-Morrison-Woodbury formula (Hager, 1989) and some algebra we have, 

E = = L»-Z14>^($D4>^-tI„)"^4>Li, p = LI4>^(4>D4>^-t I„)"^a. (7) 

However, the above identity for S on its own does not immediately help us to sample from 
N(0, S). Here is where the data augmentation idea is useful. Define C = (u^,rt^)^ G 
where u and v are defined in step (i) and (ii) of the algorithm. Clearly, C has a mean 




where P = + 


zero multivariate normal distribution with covariance = 

In), S = and R = D. The following identity is easily verified; 

/P / In 0\(P 0 Win P-^S\ 

Rj \s^p-^ ij VO R-S^p-^SJ\0 Ip )' 

' -'-V--V--V-" 

n L r lt 


P S 
R 


( 8 ) 


Note that T is block diagonal, with the lower pxp block of T given by P — S'^P~^S equaling 
S by (7). Further, L is invertible, with L~^ = ^ which is easily derived since 

L is lower triangular. Therefore, T = . 

We already have C which is a sample from N(0,fl). Defining clearly Q ~ 

N(0, T). Thus if we collect the last p entries of C* , they give us a sample from N(0, S). Some 
further algebra produces the algorithm. 
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